Research Statement Data Cleaning Algorithmic Data-cleaning Techniques

نویسنده

  • Jiannan Wang
چکیده

With the increasing amount of available data, turning raw data into actionable information is a requirement in every field. However, one bottleneck that impedes the process is data cleaning. Data analysts usually spend over half of their time cleaning data that is dirty — inconsistent, inaccurate, missing, and so on — before they even begin to do any real analysis. It is a time consuming and costly process but necessary for obtaining high-quality answers from dirty data. This is further exacerbated in emerging Big Data scenarios when data volumes are increasing, or when the data is integrated from a larger variety of sources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using well defined tokens in similarity function for record matching in data cleaning techniques

The integration of information is an important area of research in databases. The duplicate elimination problem of detecting database records that are approximate duplicates, but not exact duplicates, which describe the same real world entity, is an important data cleaning problem. To ensure high data quality, data warehouse must cleanse data by detecting and eliminating the redundant data. Dur...

متن کامل

Academic Statement by Leopoldo Bertossi

(A) Data Management and Business Intelligence. Specific areas of interest and research have been: (a) Inconsistency management in databases. (b) Virtual data integration. (c) Multidimensional databases, in particular semantics problems and their impact on OLAP and data analytics. (d) Peer data exchange. (e) Contexts for data management. (f) Data quality assessment and data cleaning, in particul...

متن کامل

A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets

Data cleaning, whether manual or algorithmic, is rarely perfect leaving a dataset with an unknown number of false positives and false negatives after cleaning. In many scenarios, quantifying the number of remaining errors is challenging because our data integrity rules themselves may be incomplete, or the available gold-standard datasets may be too small to extrapolate. As the use of inherently...

متن کامل

The effect of data cleaning on record linkage quality

BACKGROUND Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality. METHODS A range of cle...

متن کامل

An Algorithmic Approach to Data Preprocessing in Web Usage Mining

Web usage Mining is an area of web mining which deals with the extraction of interesting knowledge from logging information produced by web server. Different data mining techniques can be applied on web usage data to extract user access patterns and this knowledge can be used in variety of applications such as system improvement, web site modification, business intelligence etc. Web usage minin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014